Generalised Linear Models

One of the assumptions of linear models is that the residuals are normally distributed. This assumption is often violated with certain types of data that are bounded at particular values or are discrete values. Count data (integers) are a perfect example of data that is both bounded and discrete. Historically count data had been log or square root transformed and fit assuming it met the assumption of normally distributed residuals. This can work in some cases, but it has the potential to create nonsensical predictions like negative counts.

An alternative modeling approach is increasingly encouraged in introductory statistics and data analysis courses - generalised linear models. Rather than spend too much time here going into the math and statistics behind this approach, we are going to focus on the application of this tool and cover how to use the appropriate code. There are helpful resources that can explain the math and theory behind this approach including this online book by John Fieberg at the University of Minnesota.

The generalised linear model is similar to the linear model in that both models have the data - dependent variable (y), the independent variables (x), the estimates of the relationship between the independent and dependent variables (known as beta estimates) and an error component. The additional component for the generalised linear model is the link function which allows for the data in the model to be bounded and have changes in the variance - violations of assumptions from the linear model. The link function will appear as a specific argument in the glm() function we will use to fit a generalised linear model.

Count Data in R

The first example data set we will use data that came from a trial of different insecticide applications and the observed number of insects after the insecticides had been applied. The four sprays A, B, C, and F are the focus of this analysis and are filtered from the original data set.

Loading the Data

library(readr)
library(here)
## here() starts at /Users/conorfair/Library/CloudStorage/OneDrive-UniversityofGeorgia/Active Projects/Data Analysis in R
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Counts <- read_csv(here("data", "Count_Model.csv"), col_types = "df")
str(Counts)
## spc_tbl_ [72 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ count: num [1:72] 10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   count = col_double(),
##   ..   spray = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
##   .. )
##  - attr(*, "problems")=<externalptr>
# Filter only the A, B, C, and F insecticides 
Counts <- Counts %>%
  filter(spray == 'A' | spray == 'B' | spray == 'C' | spray == 'F') %>%
  droplevels()

Filtering out the over levels of insecticides (spray) can be done using the filter() function, but the other levels are kept within the history of the Counts object until you use the droplevels() function.

Visualizing the Data

ggplot(Counts,aes(x = spray, y = count)) +
  geom_boxplot(width = 0.5) +
  geom_jitter(height = 0, width = 0.1, size = 3, pch = 21) +
  labs(x = "Spray Treatment Type", y = "Count") +
  theme_classic() +
  theme(axis.title = element_text(face = "bold", size = 15),
        axis.text = element_text(face = "bold", size = 15))

We included more design elements in ggplot when visualizing your data. The geom_jitter() function allows you to shift either the height or width of the points to prevent the individual points from being plotted on top of each other. The size and point shape is also adjusted. Lastly the axis title and text is modified in the theme() function to make the text larger and bold faced font. This highlights another layer of modifications that you can make to the aesthetics of these plots.

Formulate Hypotheses

From here we can formulate the question and hypothesis: is there variation in the number of insects observed among the different insecticide applications?

Null hypothesis: there is no variation in the number of insects observed among the different insecticide applications Alternative hypothesis: there is at least one difference in the number of insects observed between the different insecticide applications

Fit Linear Model and Test Assumptions

We are going to illustrate the differences between fitting a linear model on count data, log-transformed count data, and a generalised linear model with the log-link function and an assumed Poisson distribution.

mod_lm <- lm(count ~ spray, data = Counts)

# mod_log <- lm(log(count) ~ spray, data = Counts) log of zero is undefined
# log1p adds 1 to each integer value so log(1) is zero

mod_log <- lm(log1p(count) ~ spray, data = Counts)
mod_glm <- glm(count ~ spray, family = poisson(link = log), data = Counts)

library(performance)
check_model(mod_lm, detrend = F)

check_model(mod_log, detrend = F)

check_model(mod_glm, detrend = F)

Comparing the residuals among the three different models we see considerable improvements in the homogeneity of variance and normally distributed residual plots for the glm model.

You may have noticed an additional graph from the check_model() output where we were checking the assumptions of the count model fit with a poisson distribution and a link function. Zero-inflation is a potential issue with count models (among others), and we will briefly discuss this during the Advanced Modeling Example section.

Interpret Model Results

car::Anova(mod_glm, test.statistic = "F") # type doesn't matter here
## Analysis of Deviance Table (Type II tests)
## 
## Response: count
## Error estimate based on Pearson residuals 
## 
##            Sum Sq Df F value    Pr(>F)    
## spray     185.831  3  35.832 7.037e-12 ***
## Residuals  76.064 44                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# We need to include a new argument to request for the F statistic
summary(mod_glm)
## 
## Call:
## glm(formula = count ~ spray, family = poisson(link = log), data = Counts)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.67415    0.07581  35.274   <2e-16 ***
## sprayB       0.05588    0.10574   0.528    0.597    
## sprayC      -1.94018    0.21389  -9.071   <2e-16 ***
## sprayF       0.13926    0.10367   1.343    0.179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.154  on 47  degrees of freedom
## Residual deviance:  76.323  on 44  degrees of freedom
## AIC: 273.93
## 
## Number of Fisher Scoring iterations: 5
# Estimate for spray c
exp(2.67415-1.94018)
## [1] 2.083335

We refute the null hypothesis that there is no variation in the number of insects observed among the different insecticide applications (F (3,44) = 35.832, p value < 0.001). Because the GLM uses the link function the model results are reported on the scale of the link function - the log scale in this case. The inverse of the log link function is the exp() function. For categorical variables it is easier to look at the estimated marginal means as there is an argument to report the inverse link response data.

Figure to Explain Results

The additional argument type = "response" produces the estimated marginal means on the original scale using the inverse function of the link function used in the model. We used the log link function in this model.

The calculation of the “rate” or estimated marginal mean count and the upper and lower 95% confidence intervals are calculated correctly, but the standard error is not. We will use the expand.grid() and predict() functions to produce the same data as the emmeans() function, but we have to use the inverse link function to calculate the standard errors correctly. The emmeans() standard errors are replicated by using the type = response in the predict() function rather than using the inverse link function after the standard errors are calculated from the predict() function.

library(emmeans)
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
library(multcomp)
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
## 
##     geyser
# This approach can be used if you want to report the mean and 95% confidence intervals

emmeans_glm <- mod_glm %>%
  emmeans(specs = "spray", type = "response")%>%
  cld(Letters = letters, decreasing = TRUE) %>%
  as_tibble()

SprayNew_response = factor(c("A", "B", "C", "F"))
dataPredict_response <- expand.grid(spray = SprayNew_response)
predictions_response <- predict(mod_glm, newdata = dataPredict_response, se.fit = TRUE, type = "response")
predictions_response <- as.data.frame(predictions_response)
NewData_response <- cbind(dataPredict_response,predictions_response)
head(NewData_response)
##   spray       fit    se.fit residual.scale
## 1     A 14.500000 1.0992422              1
## 2     B 15.333333 1.1303883              1
## 3     C  2.083333 0.4166664              1
## 4     F 16.666667 1.1785113              1
# Use this approach if you want to report mean and standard error

SprayNew = factor(c("A", "B", "C", "F"))
dataPredict <- expand.grid(spray = SprayNew)
predictions <- predict(mod_glm, newdata = dataPredict, se.fit = TRUE)
predictions <- as.data.frame(predictions)
NewData <- cbind(dataPredict,predictions)
NewData$UI <- exp(NewData$fit + (1.96*NewData$se.fit))
NewData$LI <- exp(NewData$fit - (1.96*NewData$se.fit))
NewData$se.fit <- exp(NewData$se.fit)
NewData$fit <- exp(NewData$fit)

# Arrange rows by decreasing values of fit to match emmeans_glm object so that cld groups are attached in the correct order

NewData <- NewData %>%
  arrange(-fit)

# Attach cld groups to NewData object for figure

NewData <- cbind(NewData, emmeans_glm[,7])

Plot_glm <- ggplot() +
  geom_point(data = Counts, aes(y = count, x = spray), position = position_jitter(width = 0.1)) +
  geom_boxplot(data = Counts, aes(y = count, x = spray), position = position_nudge(x = -0.25), width = 0.25) +
  geom_point(data = NewData, aes(y = fit, x = spray), position = position_nudge(x = 0.15), size = 2, color = "red") +
  geom_errorbar(data = NewData, aes(ymin = fit - se.fit, ymax = fit + se.fit, x = spray), position = position_nudge(x = 0.15), color = "red", width = 0.1) +
  geom_text(data = NewData, aes(y = fit, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + 
  labs(y = "Counts \u00B1(SE)", x = "Insecticide Treatment") +
  theme_classic()
Plot_glm

ggsave(here("outputs", "Count_Model.tiff"), Plot_glm, width = 20, height = 15, units = "cm", dpi = 300)

This figure required a few more steps to ensure that we reported the correct standard errors along with the estimated means. You saw a few more functions to help manipulate the data to achieve this (e.g., arrange()).

Count models fit with a poisson distribution have an assumptions that the mean and variance are equal. This is not always the case, and when the variance is greater than the mean it is known as overdispersion. The presence of overdispersion can produce biased standard errors and impact the interpretation of your results. We will now review the procedures to test for overdispersion and means to address overdispersion, when present.

Overdispersion in Count Data

Testing for overdispersion begins with fitting a generalised linear model with a poisson distribution. We will use the data set beall.webworms from the agridat package. The researchers observed counts of webworms in a beet field with four different insecticide treatments.

Loading the Data

Webworms <- read_csv(here("data", "Overdispersion.csv"), col_types = "dddffff")
str(Webworms)
## spc_tbl_ [1,300 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ row  : num [1:1300] 1 2 3 4 5 6 7 8 9 10 ...
##  $ col  : num [1:1300] 1 1 1 1 1 1 1 1 1 1 ...
##  $ y    : num [1:1300] 1 0 1 3 6 0 2 2 1 3 ...
##  $ block: Factor w/ 13 levels "B1","B2","B3",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ trt  : Factor w/ 4 levels "T1","T2","T3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ spray: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ lead : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   row = col_double(),
##   ..   col = col_double(),
##   ..   y = col_double(),
##   ..   block = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   trt = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   spray = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   lead = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
##   .. )
##  - attr(*, "problems")=<externalptr>

Visualizing the Data

ggplot(Webworms, aes(x = spray, y = y, fill = lead)) +
  geom_violin(scale = "width", adjust = 2) + 
  geom_point(position = position_jitterdodge(jitter.width = 0.5, jitter.height = 0.1, dodge.width = 0.8), alpha = 0.1) +
  labs(x = "Spray Treatment", y = "Webworm Count", fill = "Lead Treatment") +
  theme_classic()

Formulate Hypotheses

Since the four treatments are coded as the presence or absence of both the spray and lead treatments there will need to be an interaction with the spray and lead terms.

Question: is there variation in the number of webworms observed among the different treatment combinations?

Null hypothesis: there is no variation in the number of webworms observed between the spray or lead treatments Alternative hypothesis: there is at least one difference in the number of observed webworms between the different spray and/or lead treatments.

Fit Linear Model and Test Assumptions

mod_webworms <- glm(y ~ spray * lead, family = poisson(link = log), data = Webworms)

check_model(mod_webworms, detrend = F)

check_overdispersion(mod_webworms)
## # Overdispersion test
## 
##        dispersion ratio =    1.355
##   Pearson's Chi-Squared = 1755.717
##                 p-value =  < 0.001
## Overdispersion detected.

The assumptions of the residuals for this model look good, but there is a different function to test for overdispersion from the performance package. The dispersion ratio is an estimate of the amount of overdispersion in the model fit with poisson distribution. There is a test statistic you can report as evidence for overdispersion.

The next step is to account or model for this increased variance the poisson distribution cannot handle. Another distribution - the negative binomial distribution estimates another parameter (theta) as by how much larger is the variance than the mean. We can use the function glm.nb() from the MASS package or the glmmTMB() function from the DHARMa package. Since the glmmTMB() function has many more uses - including some of what we will discuss in the Advanced Modeling section - this is the function we will be using.

Re-fit Model to Account for Overdispersion

library(glmmTMB)
## Warning in check_dep_version(dep_pkg = "TMB"): package version mismatch: 
## glmmTMB was built with TMB package version 1.9.15
## Current TMB package version is 1.9.16
## Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
mod_nb <- glmmTMB(y ~ spray * lead, data = Webworms, family = "nbinom2")

check_model(mod_nb, detrend = F)
## `check_outliers()` does not yet support models of class `glmmTMB`.

library(DHARMa)
## This is DHARMa 0.4.7. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
resid <- simulateResiduals(mod_nb)
plot(resid)

Note: there are other families similar to “nbinom2” you may see in the help page for glmmTMB, but they have minor differences. This is the recommended family for most count models with overdispersion.

Another approach to assess the residuals of a model is with the simulated quantile residuals with the simulateResiduals() function from the DHARMa package. The QQ plot and homogeneity of variance plots are interpreted similarly as the plots from the check_model() output. These plots are more difficult to interpret with small sample sizes and should be interpreted cautiously.

Interpret Model Results

summary(mod_nb)
##  Family: nbinom2  ( log )
## Formula:          y ~ spray * lead
## Data: Webworms
## 
##      AIC      BIC   logLik deviance df.resid 
##   3053.0   3078.8  -1521.5   3043.0     1295 
## 
## 
## Dispersion parameter for nbinom2 family ():    2 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.33647    0.06110   5.507 3.65e-08 ***
## sprayY       -1.02043    0.10661  -9.572  < 2e-16 ***
## leadY        -0.49628    0.09423  -5.267 1.39e-07 ***
## sprayY:leadY  0.29425    0.15972   1.842   0.0654 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#car::Anova(mod_nb, test.statistic = "F")
# F statistics are unavailable with models fit with glmmTMB - Wald chisquare tests perform similar tests
car::Anova(mod_nb)
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: y
##               Chisq Df Pr(>Chisq)    
## spray      125.5047  1  < 2.2e-16 ***
## lead        26.8005  1  2.256e-07 ***
## spray:lead   3.3942  1    0.06542 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction between spray and lead is not significant at alpha = 0.05. We can review the estimated marginal means to better understand the differences between the treatment groups.

Figure to Explain Results

emmeans_nb <- mod_nb %>%
  emmeans(~ spray * lead, type = "response")%>%
  cld(Letters = letters, decreasing = TRUE) %>%
  as_tibble()

data_treatment <- expand.grid(spray = levels(Webworms$spray), lead = levels(Webworms$lead))

worm_predictions <- predict(mod_nb, newdata = data_treatment, se.fit = TRUE)
worm_predictions <- as.data.frame(worm_predictions)
New_Worms <- cbind(data_treatment,worm_predictions)
New_Worms$UI <- exp(New_Worms$fit + (1.96*New_Worms$se.fit))
New_Worms$LI <- exp(New_Worms$fit - (1.96*New_Worms$se.fit))
New_Worms$se.fit <- exp(New_Worms$se.fit)
New_Worms$fit <- exp(New_Worms$fit)

# Arrange rows by decreasing values of fit to match emmeans_glm object so that cld groups are attached in the correct order
New_Worms <- New_Worms %>%
  arrange(-fit)

# Attach cld groups to New_Worms object for figure - 8th column
New_Worms <- cbind(New_Worms, emmeans_nb[,8])

# Rename Lead to be capitalized for facet wrap label
Webworms <- rename(Webworms, Lead = lead)
New_Worms <- rename(New_Worms, Lead = lead)

Plot_NB<-ggplot() +
  geom_point(data = Webworms, aes(y = y, x = spray), position = position_jitter(width = 0.1)) +
  facet_wrap(~ Lead, labeller = label_both) +
  geom_boxplot(data = Webworms, aes(y = y, x = spray), position = position_nudge(x = -0.3), width = 0.25) +
  geom_point(data =New_Worms, aes(y = fit, x = spray), position = position_nudge(x = 0.2), size = 2, color = "red") +
  geom_errorbar(data = New_Worms, aes(ymin = LI, ymax = UI, x = spray), position = position_nudge(x = 0.2), color = "red", width = 0.1) +
  geom_text(data = New_Worms, aes(y = fit, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.3), color = "black", angle = 0) + 
labs(x = "Spray Treatment", y = "Webworm Count \u00B1(SE)", fill = "Lead Treatment") +
  theme_classic()
Plot_NB

ggsave(here("outputs", "Overdispersion_Model.tiff"), Plot_NB, width = 20, height = 15, units = "cm", dpi = 300)

Here the results from the Anova table indicate that the two-way interaction is not significant with the alpha level of 0.05, but the follow up analysis from the post-hoc comparisons show significant differences between each pair of spray and lead treatment combination. Because these tests are overall simpler (i.e., comparing only two treatment groups) it can detect differences that might be obfuscated by the test statistic from the Anova table.

Notes: these data have a large number of observed zeros, which can create a problem for count models. This problem is known as zero-inflation, and we will briefly discuss this problem in the Advanced Modeling section.

Logistic Regression

Another type of data that is bounded and commonly fails to meet the assumptions of normally distributed residuals is binomial data - 0, or 1 coded response (y) data. These data are often reported at proportions coming from a number of success out of a known number of trials. Similar to the count models above, there are times where certain transformations may approximate normally distributed residuals, but there is still the chance that nonsensical predictions can be made (e.g., proportions that exceed 1.0 etc.).

The distributional family used in a GLM for these data is known at the binomial distribution, and the link function for the binomial distribution is commonly chosen to be the logit link.

Binomial Data in R

The example dataset we will use to demonstrate this type of analysis follows the study of 800 observations among two pepper fields documenting the presence or absence of Phytophtera disease. The response variable is the presence or absence of the disease, and the independent variables are field and soil moisture percent we will be reviewing to determine if they impact the probability of the disease being observed.

Loading the Data

Phytophtera <- read_csv(here("data", "Logistic_Regression.csv"), col_types = "ffffdd")
str(Phytophtera)
## spc_tbl_ [800 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ field  : Factor w/ 2 levels "F1","F2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ row    : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ quadrat: Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ disease: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ water  : num [1:800] 15.1 14.3 14 13.8 13.2 ...
##  $ leaf   : num [1:800] 5 2 3 0 5 5 4 5 5 5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   field = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   row = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   quadrat = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   disease = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   water = col_double(),
##   ..   leaf = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
Phytophtera$disease_num <- if_else(Phytophtera$disease == "Y", 1, 0)
str(Phytophtera)
## spc_tbl_ [800 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ field      : Factor w/ 2 levels "F1","F2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ row        : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ quadrat    : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ disease    : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ water      : num [1:800] 15.1 14.3 14 13.8 13.2 ...
##  $ leaf       : num [1:800] 5 2 3 0 5 5 4 5 5 5 ...
##  $ disease_num: num [1:800] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   field = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   row = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   quadrat = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   disease = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   water = col_double(),
##   ..   leaf = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
Phytophtera <- subset(Phytophtera, !is.na(water))

The response variable - disease is coded as “Y” or “N” for the presence or absence of the disease, but the model requires numerical values. The if_else() function is creating a 1 for any “Y” response and a 0 for any “N” response in the disease variable and creating a new variable named disease_num.

Visualizing the Data

ggplot(Phytophtera, aes(x = water, y = disease_num, color = field)) +
  geom_point() +
  theme_classic()

ggplot(Phytophtera, aes(x = water, y = disease_num, color = field)) +
  geom_jitter(height = 0.1) +
  theme_classic()

You can see the real value in using the geom_jitter() function to better see the individual points that is otherwise obscured when using the geom_point() function. Based on this figure we may see some difference in the probability of observing the disease along the gradient of soil moisture percent (water) and that this maybe different between the two fields. This leads us to testing an interaction effect.

Fomulate Hypotheses

Question: is there a relationship between the soil moisture and the probability of the disease being observed, and is this relationship the same for each field?

Null hypothesis: there is no variation in the probability of observing the disease jointly along the gradient of soil moisture and between fields Alternative hypothesis: there is some variation in the probability of observing the disease jointly along the gradient of soil moisture and between fields

Fit Linear Model and Test Assumptions

mod_binomial <- glm(disease_num ~ field * water, data = Phytophtera, family = binomial(link = "logit"))

# Inverse link for the chosen link function of the model - used for transforming predictions later
ilink <- family(mod_binomial)$linkinv

check_model(mod_binomial, detrend = F)

resid <- simulateResiduals(mod_binomial)
plot(resid)

The residuals from the check_model() function can be more difficult to interpret for assessing homogeneity of residuals. The simulated quantile residuals from the DHARMa package are easier to read. All assumptions appear to be satisfied.

Interpret Model Results

car::Anova(mod_binomial, test.statistic = "F", type = 3)
## Analysis of Deviance Table (Type III tests)
## 
## Response: disease_num
## Error estimate based on Pearson residuals 
## 
##             Sum Sq  Df F values    Pr(>F)    
## field        26.67   1  25.6585 5.080e-07 ***
## water         0.76   1   0.7297    0.3933    
## field:water  36.85   1  35.4549 3.931e-09 ***
## Residuals   818.06 787                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod_binomial)
## 
## Call:
## glm(formula = disease_num ~ field * water, family = binomial(link = "logit"), 
##     data = Phytophtera)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.61224    0.82857  -3.153  0.00162 ** 
## fieldF2       -5.82953    1.18699  -4.911 9.05e-07 ***
## water          0.06673    0.07437   0.897  0.36953    
## fieldF2:water  0.61732    0.10802   5.715 1.10e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 648.80  on 790  degrees of freedom
## Residual deviance: 534.19  on 787  degrees of freedom
## AIC: 542.19
## 
## Number of Fisher Scoring iterations: 6

Similar to the summary results from the count model the estimates are on the link scale (logit) this is known as the log odds, which is difficult to interpret. By using the inverse link function

Figure to Explain Results

pdat <- expand.grid(water = seq(min(Phytophtera$water), max(Phytophtera$water), length.out = 1000), field = levels(Phytophtera$field))

pred <- predict(mod_binomial, newdata = pdat, type = "link", se.fit = TRUE)
pdat <- cbind(pred,pdat)
# Inverse link function from earlier
pdat <- pdat %>%
  mutate(fitted = ilink(fit),
         upper = ilink(fit + (1.96 * se.fit)),
         lower = ilink(fit - (1.96 * se.fit)))

# Rename Field to be capitalized for facet wrap label
Phytophtera <- rename(Phytophtera, Field = field)
pdat <- rename(pdat, Field = field)


Plot_logit <- ggplot(Phytophtera, aes(x = water, y = disease_num)) +
  facet_wrap(~ Field, labeller = label_both) +
  geom_ribbon(data = pdat,
              aes(ymin = lower, ymax = upper, x = water),
              alpha = 0.2, inherit.aes = FALSE) +
  geom_point() +
  geom_line(data = pdat, aes(y = fitted, x = water)) +
  labs(x = "Soil Moisture %", y = "Probability of Observing Disease") +
  theme_classic()

ggsave(here("outputs", "Logistic_Model.tiff"), Plot_logit, width = 20, height = 15, units = "cm", dpi = 300)

emmeans_binomial <- mod_binomial %>%
  emtrends(~ field, var = "water") %>%
  cld(Letters = letters, decreasing = TRUE)

The resutls from this model can be interpreted as such:

The null hypothesis that there is no variation in the probability of observing the disease jointly along the gradient of soil moisture and between fields was refuted (F (1, 787) = 35.4549, p value < 0.001). The comparison of the estimated marginal means of linear trends using the emtrends function in the emmeans package (Lenth 2025) demonstrated that the slope estimate for the effect of soil moisture on the probability of observing the disease in Field 2 was significantly different from the slope in Field 1. Furthermore, the slope estimate for the relationship between soil moisture and the probability of observing the disease in Field 1 was not significantly different from zero.

The data set is called “Split_Plot_Design.csv” and is located in the “data” folder. These data come from (Gomez & Gomez 1984) and measure the rice yield (kg/ha) of four varieties (G) that are exposed to one of six nitrogen treatments (N) following a split-plot design. The nitrogen treatments are applied at the main plot level and the four varieties are randomly

The data set is a split-plot design where the main plot is divided into sub-plots. The main plot has two treatments (A and B) and the sub-plots have three treatments (1, 2, and 3). The data set has a categorical variable for the main plot treatment, a categorical variable for the sub-plot treatment, and a continuous variable for the yield data. The data set has 60 observations.

library(desplot)
grays <- colorRampPalette(c("white","#252525"))
desplot(Webworms, y ~ col*row|spray,
        col.regions=grays(10),
        at=0:10-0.5,
        out1=block, out2=trt, num=trt, flip=TRUE, # aspect unknown
        main="beall.webworms (count of worms)")